## [1] "Number of reads: 204399"
## [1] "Execution time: 2.066465 hours"
HVR can be delimited if V and J segments and their conserved Cys and Phe codons are identified in the read. This is the case for most of the reads (98.1%). Also, it is more common for Cys and Phe codons to be in frame (Frame 0).
Only 1.9% of the reads lack of at least one of the four mentioned elements. These reads are discarded from the analysis.
Detailed results of the vj_alignment function are shown in the following tables:
| Status | # Reads |
|---|---|
| HVR found | 200445 |
| HVR not found | 3954 |
| Status | # Reads |
|---|---|
| Frame 0 | 171402 |
| Frame 1 | 11106 |
| Frame 2 | 17937 |
| No Cys | 1297 |
| No Cys no Phe | 35 |
| No J | 918 |
| No Phe | 1643 |
| No V | 61 |
Reads with delimited HVR (200445 reads) can be grouped into unique clonotypes (same V and J match and HVR sequence). This results in 43699 unique reads:
There are 990 VJ families in the sample. A heatmap with the number of unique HVR sequences per VJ family is shown here:
140 VJ families with only one HVR sequence are excluded from the clustering step. These families are shown here:
Therefore, 850 VJ families undergo the clustering.
Affinity propagation (AP) is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges. More information about APC and the 5 approaches tested here could be found in HVR_clustering_sample_111H notebook.
Total number of exemplars is shown at the bottom of each table (number of entries).
q = 0Minimum p value (tends to underestimate the number of clusters).
q = 0.25q = 0.5Median p value (tends to overestimate the number of clusters).
q = 0.75Between median and maximum p value (tends to overestimate the number of clusters).
q = 1Maximum p value (tends to overestimate the number of clusters).
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
pSpecify p for each HVR based on abundance (n) modified to be in the same range as the similarity matrix s. Proposed formula:
\[ p_{i} = min(s) + \left(\frac{(max(s) - min(s)) · (n_{i} - min(n))}{max(n) -min(n)}\right) \]
sModify s according to Zhang et al paper (section 1.2.1).
\[ s'_{(i,j)}=\begin{cases} -n_{i}\: d_{(i,j)}^{2} & \text{ if } i\neq j \\ s^{*} & \text{ otherwise } \end{cases} \]
where \(d\) is the distance between sequences and \(s^{*}\) is a small constant called preference (in our case the principal diagonal is 0).
MiXCR unique clonotypes: 7020 (multiple reads) + 1084 (one read) = 8104
The APC model that yields the closest number of exemplars would be q = 0.5 (N = 7023). Our best model according to the supervised measurements was "Adjust p" (N = 3387).
For the clustering step we use 53000 reads that MiXCR discarded.
Dendrograms of 850 VJ families with more than one distinct HVR sequence were plotted for manual annotation (all files available in /media/bacon/Carazo_TCRSeq_IonTorrentS5/03_sequenceAnalysis/april_2020/plots/dendrograms_sample_111L_weighted_hclust_wardD). The method used was a weighted hierarchical clustering with ward.D linkage, as it showed to be the best linkage method for sample 111H.
~80 dendrograms were randomly selected for manual annotation. The Google Docs presentation with the annotated dendrograms (work in progress) is here.
For clustering models evaluation and comparison, "true" exemplars were manually selected for 75 (out of 87, work in progress) randomly selected VJ families in sample 111L. Selected exemplars on dendrograms can be seen here. The following table contains a summary of the manual exemplar selection (with comments in special cases):
With this "ground truth", we can evaluate the clustering models in terms of supervised classification metrics such as:
Accuracy: ratio of correctly predicted observation to the total observations. \[accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]
Recall (sensitivity or true positive rate): ratio of correctly predicted positive observations to the all observations in actual class. \[recall = \frac{TP}{TP + FN}\]
F1 score: weighted average of precision and recall (this score takes both false positives and false negatives into account).
\[F_{1} = 2 * \frac{recall · precision}{recall + precision}\]
q for each VJ familyThe results with 111L are not as good as with sample 111H, which makes us think that the differences in diversity directly affect the clustering process. Since q parameter (indirectly) determines the number of exemplars, q should be higher in larger/more diverse families to choose more exemplars. The problem is to estimate the family diversity in an uncorrected set of reads. Three estimators are proposed:
The following figures are interactive and show the relationship between classification metrics (accuracy, recall, precision, F1) and the three proposed estimators. Performance metrics can be hidden or shown by clicking them on the figure legend.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 40 rows containing non-finite values (stat_smooth).
A diversity order, \(q [0, 3]\), should be specified. \(q = 2\) was chosen arbitrarily.
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 40 rows containing non-finite values (stat_smooth).
¿Which functional diversity q value gives the best correlation between functional diversity and true number of HVRs?
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 4 row(s) containing missing values (geom_path).
hillR package¿Is there a relationship between any of the estimators and the optimal q parameter?
Let's check which q value gives the best model for each family and see if there is a correlation with unique HVR / Shannon's Index / Functional diversity. The chosen metric to evaluate the models is F1, since our dataset is very imbalanced (low number of exemplars or true positives and large number of non-exemplars or true negatives).
It seems that functional diversity presents the best correlation with the optimal q parameter.
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
Y axis was log10-scaled for better visualization.
Y axis was log10-scaled for better visualization.
In a similar way that we adjusted p for each HVR based on its frequency (bringing the frequency to the similarity matrix values range), we could adjust q for a VJ family bringing its functional diversity value to the range 0-1. Since this transformation is lineal and a lineal correlation with q was observed with log10-transformed functional diversity values, here the values should be log10-transformed as well.
Weighted (q) is the second best model in terms of accuracy and F1 (q=0.75 is the best one).
Histograms of Levenshtein distances distribution for each VJ family are available in /media/bacon/Carazo_TCRSeq_IonTorrentS5/03_sequenceAnalysis/april_2020/plots/levenshtein_distance_histograms_111L/.
From Fenias-Moiane et al 2.2 Section:
Adaptive Affinity Propagation (AAP) algorithm is an AP variant developed to deal with issues related with setting the initial preference value p which is determined by taking the minimum or the median of the similarity S(i,j), and occurrence of oscillations in the original AP algorithm. The choice of preference influences on the final resulting number of clusters, and the oscillation may often cause the process to fail to reach the convergence.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Rcpp_1.0.5 tidyr_1.1.2 vegan_2.5-6
## [4] lattice_0.20-38 permute_0.9-5 plotly_4.9.0
## [7] reshape2_1.4.4 DT_0.16 apcluster_1.4.8
## [10] WeightedCluster_1.4-1 cluster_2.1.0 TraMineR_2.2-0.1
## [13] bsselectR_0.1.0 stringr_1.4.0 ggtree_2.0.4
## [16] plyr_1.8.6 ggpubr_0.4.0 ggplot2_3.3.2
## [19] webr_0.1.6 dplyr_1.0.2 msa_1.16.0
## [22] Biostrings_2.52.0 XVector_0.24.0 IRanges_2.18.3
## [25] S4Vectors_0.22.1 BiocGenerics_0.30.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 uuid_0.1-4
## [3] backports_1.2.0 Hmisc_4.2-0
## [5] systemfonts_0.3.2 lazyeval_0.2.2
## [7] splines_3.6.1 crosstalk_1.1.0.1
## [9] BiocParallel_1.18.1 GenomeInfoDb_1.20.0
## [11] digest_0.6.27 htmltools_0.5.1.1
## [13] magrittr_1.5 checkmate_2.0.0
## [15] openxlsx_4.2.3 readr_1.4.0
## [17] matrixStats_0.57.0 officer_0.3.15
## [19] jpeg_0.1-8.1 colorspace_1.4-1
## [21] haven_2.3.1 xfun_0.19
## [23] RCurl_1.98-1.2 crayon_1.3.4
## [25] jsonlite_1.7.1 rrtable_0.2.1
## [27] survival_3.2-7 zoo_1.8-8
## [29] ape_5.4-1 glue_1.4.2
## [31] polyclip_1.10-0 rvg_0.2.5
## [33] gtable_0.3.0 zlibbioc_1.30.0
## [35] DelayedArray_0.10.0 sjmisc_2.8.5
## [37] car_3.0-10 DEoptimR_1.0-8
## [39] abind_1.4-5 scales_1.1.1
## [41] rstatix_0.7.0 miniUI_0.1.1.1
## [43] viridisLite_0.3.0 xtable_1.8-4
## [45] htmlTable_2.1.0 tmvnsim_1.0-2
## [47] tidytree_0.3.3 foreign_0.8-72
## [49] Formula_1.2-4 vcd_1.4-8
## [51] htmlwidgets_1.5.2 httr_1.4.2
## [53] RColorBrewer_1.1-2 acepack_1.4.1
## [55] ellipsis_0.3.1 pkgconfig_2.0.3
## [57] farver_2.0.3 nnet_7.3-12
## [59] sass_0.3.1 labeling_0.4.2
## [61] tidyselect_1.1.0 rlang_0.4.11
## [63] later_1.1.0.1 munsell_0.5.0
## [65] cellranger_1.1.0 tools_3.6.1
## [67] generics_0.1.0 devEMF_4.0-2
## [69] sjlabelled_1.1.7 moonBook_0.2.3
## [71] broom_0.7.6 evaluate_0.14
## [73] fastmap_1.0.1 yaml_2.2.1
## [75] knitr_1.30 robustbase_0.93-6
## [77] zip_2.1.1 purrr_0.3.4
## [79] nlme_3.1-140 mime_0.9
## [81] xml2_1.3.2 compiler_3.6.1
## [83] rstudioapi_0.11 curl_4.3
## [85] png_0.1-7 ggsignif_0.5.0
## [87] treeio_1.8.2 tibble_3.0.4
## [89] tweenr_1.0.1 bslib_0.2.4
## [91] stringi_1.5.3 highr_0.8
## [93] forcats_0.5.0 gdtools_0.2.2
## [95] Matrix_1.2-17 psych_2.0.9
## [97] vctrs_0.3.8 pillar_1.4.6
## [99] lifecycle_0.2.0 BiocManager_1.30.10
## [101] editData_0.1.2 lmtest_0.9-38
## [103] jquerylib_0.1.3 bitops_1.0-6
## [105] data.table_1.13.2 insight_0.10.0
## [107] flextable_0.5.11 ztable_0.2.2
## [109] GenomicRanges_1.36.1 httpuv_1.5.4
## [111] hwriter_1.3.2 R6_2.5.0
## [113] latticeExtra_0.6-29 ShortRead_1.42.0
## [115] promises_1.1.1 gridExtra_2.3
## [117] rio_0.5.16 boot_1.3-23
## [119] MASS_7.3-51.1 SummarizedExperiment_1.14.1
## [121] shinyWidgets_0.5.4 withr_2.3.0
## [123] Rsamtools_2.0.3 GenomicAlignments_1.20.1
## [125] mnormt_2.0.2 GenomeInfoDbData_1.2.1
## [127] mgcv_1.8-28 hms_0.5.3
## [129] grid_3.6.1 rpart_4.1-15
## [131] rmarkdown_2.7 rvcheck_0.1.8
## [133] carData_3.0-4 ggforce_0.3.1
## [135] Biobase_2.44.0 shiny_1.5.0
## [137] base64enc_0.1-3